{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Evaluation with AlpacaEval 2.0.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation of a Hallucination Classifier \n",
    "\n",
    "This notebook demonstrates how to evaluate a Large Language Model (LLM) classifier using Oumi's custom evaluation. In particular, we focus on a binary hallucination classifier for open-book question-answering (Q&A). Given a context document and a hypothesis, the classifier’s task is to determine whether the hypothesis is grounded in the context (i.e., not hallucinated). The context can be thought of as a text snippet retrieved using Retrieval-Augmented Generation (RAG), while the hypothesis represents an LLM-generated response. \n",
    "\n",
    "The underlying model used for this assessment is a generative LLM, which produces responses in free text. As a result, the prompt must explicitly specify the desired format for the response, and post-processing logic is needed to extract the prediction from the model's free-form response."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites and Environment\n",
    "\n",
    "❗**NOTICE:** We recommend running this notebook on a GPU. If running on Google Colab, you can use the free T4 GPU runtime (Colab Menu: `Runtime` -> `Change runtime type`).\n",
    "\n",
    "### Oumi Installation\n",
    "\n",
    "First, let's install Oumi. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install oumi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tutorial Directory Setup\n",
    "\n",
    "Next, we will create a directory for the tutorial, to store the evaluation configuration and the experimental results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "tutorial_dir = \"custom_eval_tutorial\"\n",
    "\n",
    "Path(tutorial_dir).mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Limit the Number of Examples\n",
    "\n",
    "This notebook limits the total number of examples to 10, for testing purposes. The dataset we will use (ANLI) contains 1,200 examples. To use the full dataset, set the number of examples to `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_EXAMPLES = 10  # Replace with `None` for full dataset evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Dataset\n",
    "\n",
    "### Base Data: ANLI\n",
    "\n",
    "We will use the [Adversarial Natural Language Inference (ANLI)](https://arxiv.org/abs/1910.14599) dataset by Meta for our evaluations. This is an adversarial NLI dataset collected via an iterative, adversarial human-and-model-in-the-loop procedure. It consists of a `context` (premise), a `hypothesis`, and a `label`. The `context` is a short snippet of text that is considered the \"ground truth\", serving as our reference to validate whether a hypothesis is supported or not. The `hypothesis` is a claim that should be grounded based on the `context`, while the `label` indicates whether the hypothesis is a valid entailment (`0`), neutral (`1`), or a contradiction (`2`), based on the `context`. For our experimental analysis, if the hypothesis is not a valid entailment, we consider it hallucinated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'uid': 'b0e63408-53af-4b46-b33d-bf5ba302949f', 'premise': \"It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.\", 'hypothesis': 'The day of the passage is usually when Christians praise the lord together', 'label': 0, 'reason': \"Sunday is considered Lord's Day\"}\n"
     ]
    }
   ],
   "source": [
    "import datasets\n",
    "\n",
    "anli_dataset = datasets.load_dataset(\"facebook/anli\", split=\"test_r3\")\n",
    "\n",
    "print(anli_dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Designing the Prompt\n",
    "\n",
    "For ANLI to be understood by an LLM, we need to covert the list of (`context`, `hypothesis`) tuples to a list of prompts, providing the model with an actionable request. When designing LLM prompts, each prompt typically consists of a static part and a dynamic part (the actual content). The static text describes the input and output formats to the model, as well as the criteria to generate the desired answer. The content is the `context` and `hypothesis` fields coming from ANLI. Below, you can see the template we used to generate the prompts for our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_template = \"\"\"You will be given a premise and a hypothesis.\n",
    "Determine if the hypothesis is <|supported|> or <|unsupported|> based on the premise.\n",
    "\n",
    "premise: {premise}\n",
    "\n",
    "hypothesis: {hypothesis}\n",
    "\n",
    "You are allowed to think out loud.\n",
    "Ensure that your final answer is formatted as <|supported|> or <|unsupported|>.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating the Evaluation Dataset\n",
    "\n",
    "The code snippet below demonstrates how we convert the ANLI dataset to an evaluation dataset (a list of Oumi `Conversation`s). Each conversation consists of a `conversation_id` (which is set to ANLI's `uid`), a single user `message` (added in the `messages` list) which contains the prompt, and metadata. We retain the ANLI label in the conversations' metadata, after we convert it to a binary label: `0` is corresponding to a supported hypothesis, `1` is corresponding to an unsupported hypothesis (hallucination)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2025-03-12 11:09:11,968][oumi][rank0][pid:90050][MainThread][INFO]][base_map_dataset.py:91] Creating map dataset (type: TextSftJsonLinesDataset)... dataset_name: 'custom'\n"
     ]
    }
   ],
   "source": [
    "from oumi.datasets import TextSftJsonLinesDataset\n",
    "\n",
    "\n",
    "# Convert the ANLI label to a binary label.\n",
    "def _convert_label_to_binary(label):\n",
    "    if label == 0:\n",
    "        return 0  # supported\n",
    "    elif label in [1, 2]:\n",
    "        return 1  # unsupported (hallucination)\n",
    "    else:\n",
    "        raise ValueError(f\"Invalid label: {label}\")\n",
    "\n",
    "\n",
    "# Convert the ANLI dataset into a list of conversations (oumi.core.types.Conversation).\n",
    "evaluation_dataset = []\n",
    "for anli_example in anli_dataset:\n",
    "    prompt = prompt_template.format(\n",
    "        premise=anli_example[\"premise\"], hypothesis=anli_example[\"hypothesis\"]\n",
    "    )\n",
    "    message = {\"role\": \"user\", \"content\": prompt}\n",
    "    metadata = {\"label\": _convert_label_to_binary(anli_example[\"label\"])}\n",
    "    conversation = {\n",
    "        \"conversation_id\": anli_example[\"uid\"],\n",
    "        \"messages\": [message],\n",
    "        \"metadata\": metadata,\n",
    "    }\n",
    "    evaluation_dataset.append(conversation)\n",
    "\n",
    "# Limit the number of examples, if requested.\n",
    "if NUM_EXAMPLES:\n",
    "    evaluation_dataset = evaluation_dataset[:NUM_EXAMPLES]\n",
    "\n",
    "# Read the dataset into an Oumi class.\n",
    "MY_DATASET = TextSftJsonLinesDataset(data=evaluation_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "\n",
    "This section discusses how we can leverage Oumi's custom evaluation to run inference on the prompts, extract the predictions from the model responses, and finally calculate metrics to assess our classification quality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Specify How to Extract the Prediction\n",
    "\n",
    "Given our prompt design (discussed earlier in this notebook), the response is expected to contain either the `<|supported|>` or `<|unsupported|>` tag. The simplest extraction we can perform is look up for either of these tags in the response. Note that if both or none of the tags is included in the response, that indicates that our model's assessment is ambiguous. \n",
    "\n",
    "The function below returns `0` if the hypothesis is supported, `1` if the hypothesis is NOT supported, and `-1` when the LLM's response is ambiguous (i.e., no assessment is available)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def _extract_prediction(response):\n",
    "    is_unsupported = \"<|unsupported|>\" in response\n",
    "    is_supported = \"<|supported|>\" in response\n",
    "    if is_unsupported == is_supported:\n",
    "        return -1\n",
    "    return 0 if is_supported else 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Define the Custom Evaluation Function\n",
    "\n",
    "Our evaluation framework supports custom evaluation functions, which can be registered with the decorator `@register_evaluation_function`, and retrieved in our evaluation config by directly referencing the registered function name (`my_custom_evaluation`) in the `task_name` (see `YAML` code in Step 3).\n",
    "\n",
    "Below, we define a custom evaluation function and register it as `my_custom_evaluation`. The evaluation function first runs inference, using the `inference_engine` that has been instantiated based on the `YAML` configuration (`inference_engine: NATIVE`; see Step 3). During inference, the engine will append the LLM's (assistant's) response into each conversation, as its last message. Thus, we extract the predictions from the last messages and store them in a list (`y_pred`). We also store the corresponding labels in a list (`y_true`). The labels can be found in the conversations' metadata, as discussed earlier (see section: Create the Evaluation Dataset).\n",
    "\n",
    "Once we have the lists of predictions (`y_pred`) and labels (`y_true`), we can compute any relevant metrics (F1, Precision, Recall, BACC, etc). In this notebook we choose the Balanced Accuracy (BACC) from the sklearn library. The final step is to return all computed metrics in a dict.\n",
    "\n",
    "Note that registering an evaluation function has been designed with flexibility as the primary goal. There are **no requirements on what inputs to use**; for example: whether to use Oumi's inference engine, or leverage the input dataset. These are only provided for user convenience. In practice, you can **use your own inference** or not use one. You also don't have to pass a dataset into this function; you can **read data from any source you want**: a file, a database, your custom function, or not use a dataset during evaluation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import balanced_accuracy_score\n",
    "\n",
    "from oumi.core.registry import register_evaluation_function\n",
    "\n",
    "\n",
    "@register_evaluation_function(\"my_custom_evaluation\")\n",
    "def my_custom_evaluation(inference_engine, dataset):\n",
    "    \"\"\"Custom evaluation function registered as `my_custom_evaluation`.\"\"\"\n",
    "    # Run inference to generate the model responses.\n",
    "    conversations = inference_engine.infer(dataset.conversations())\n",
    "\n",
    "    y_true, y_pred = [], []\n",
    "    for conversation in conversations:\n",
    "        # Extract the assistant's (LLM's) response from the conversation.\n",
    "        response = conversation.last_message()\n",
    "\n",
    "        # Extract the prediction from the response.\n",
    "        prediction = _extract_prediction(response.content)\n",
    "\n",
    "        # Record the valid predictions (!= -1) together with their labels.\n",
    "        if prediction != -1:\n",
    "            y_pred.append(prediction)\n",
    "            y_true.append(conversation.metadata[\"label\"])\n",
    "\n",
    "    # Compute any relevant metrics (such as Balanced Accuracy).\n",
    "    bacc = balanced_accuracy_score(y_true, y_pred)\n",
    "\n",
    "    return {\"bacc\": bacc}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Set the Evaluation Configuration\n",
    "\n",
    "The easiest way to configure our custom evaluation is by setting a `YAML` file, as shown below. This file identifies the model to evaluate, sets our generation parameters, the inference engine to use, and the output directory. Most importantly, it points to a single evaluation task (under `tasks`) describing the custom task to run:\n",
    "- `evaluation_backend`: `custom`, indicating a custom evaluation.\n",
    "- `task_name`: `my_custom_evaluation`, pointing to the name of the custom evaluation function that we registered (see Step 2).\n",
    "\n",
    "Note: Evaluating **any open or closed source model** is a **simple config change**!\n",
    "- This example is evaluating SmolLM2; a small (135B) model that we use for testing purposes. \n",
    "- For open-source or custom models, just set the `model_name` to your desired HuggingFace model ID, or point to a local path. \n",
    "- For closed-source models, besides setting the `model_name` (e.g. `gpt-4o`), the only additional things you need to do is to set the `inference_engine` to the [relevant type](https://oumi.ai/docs/en/latest/user_guides/infer/configuration.html#engine-selection) (`OPENAI`, `ANTHROPIC`, `GOOGLE_GEMINI`, `DEEPSEEK`, etc) and set an API key (if needed) in [remote params](https://oumi.ai/docs/en/latest/user_guides/infer/configuration.html#remote-configuration). Visit [our documentation](https://oumi.ai/docs/en/latest/user_guides/infer/inference_engines.html) for more details. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "yaml_content = \"\"\"\n",
    "model:\n",
    "  model_name: \"HuggingFaceTB/SmolLM2-135M-Instruct\"\n",
    "  model_max_length: 2048\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  trust_remote_code: True\n",
    "\n",
    "generation:\n",
    "  batch_size: 1\n",
    "\n",
    "tasks:\n",
    "  # List of tasks to evaluate (only a single task in this case).\n",
    "  - evaluation_backend: custom\n",
    "    task_name: my_custom_evaluation\n",
    "\n",
    "inference_engine: NATIVE\n",
    "\n",
    "output_dir: \"tutorial/output\"\n",
    "\"\"\"\n",
    "\n",
    "with open(f\"{tutorial_dir}/custom_eval.yaml\", \"w\") as f:\n",
    "    f.write(yaml_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Run the Evaluation\n",
    "\n",
    "Putting everything together: Once we have constructed the dataset, registered an evaluation function, and set the evaluation config, we can run the evaluation as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2025-03-12 11:09:36,709][oumi][rank0][pid:90050][MainThread][INFO]][models.py:208] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...\n",
      "[2025-03-12 11:09:36,850][oumi][rank0][pid:90050][MainThread][INFO]][models.py:276] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.\n",
      "[2025-03-12 11:09:37,497][oumi][rank0][pid:90050][MainThread][INFO]][models.py:482] Using the model's built-in chat template for model 'HuggingFaceTB/SmolLM2-135M-Instruct'.\n",
      "[2025-03-12 11:09:37,518][oumi][rank0][pid:90050][MainThread][INFO]][native_text_inference_engine.py:140] Setting EOS token id to `2`\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Generating Model Responses: 100%|██████████| 10/10 [00:12<00:00,  1.21s/it]\n"
     ]
    }
   ],
   "source": [
    "from oumi.core.configs import EvaluationConfig\n",
    "from oumi.core.evaluation import Evaluator\n",
    "\n",
    "config = EvaluationConfig.from_yaml(str(Path(tutorial_dir) / \"custom_eval.yaml\"))\n",
    "\n",
    "evaluator = Evaluator()\n",
    "results = evaluator.evaluate(config, dataset=MY_DATASET)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5: Inspect the Results\n",
    "\n",
    "The code below shows how we can inspect the list of `results`. Here, we have a single result returned by the custom evaluation function, since our evaluation config only defined one task. \n",
    "\n",
    "We note that though BACC is perfect (`1.0`), SmolLM2 135M is actually NOT a great hallucination classifier; the positive outcome is merely because we are only sampling 10 examples. We selected SmolLM2 in this notebook because of its small size, to optimize for execution speed. We strongly recommend larger models for accurate classifications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BACC: 1.0\n",
      "Execution duration in sec: 12\n"
     ]
    }
   ],
   "source": [
    "custom_task_results: dict = results[0].get_results()\n",
    "\n",
    "print(\"BACC:\", custom_task_results[\"bacc\"])\n",
    "print(\"Execution duration in sec:\", results[0].elapsed_time_sec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🧭 What's Next?\n",
    "\n",
    "Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).\n",
    "\n",
    "📰 Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!\n",
    "\n",
    "⚡ Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi-ai.typeform.com/early-access) to the Oumi Platform, or [chat with us](https://calendly.com/d/ctcx-nps-47m/chat-with-us-get-early-access-to-the-oumi-platform) to learn more!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
